Combining multiple thresholding binarization values to improve OCR output

نویسندگان

  • William B. Lund
  • Douglas J. Kennard
  • Eric K. Ringger
چکیده

For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple global threshold binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output and producing a lattice of word alternatives from which a lattice word error rate (LWER) is calculated. Our results show a LWER of 7.6% when aligning two threshold images down to a LWER of 6.8% when aligning five. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving 8.41% WER, a 39.1% reduction in error rate relative to the performance of the original OCR engine on this data set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Local Contrast and Mean Thresholding in Image Binarization

Binarization is a process of separation of pixel values of an input image into two pixel values like white as background and black as foreground. It is an important part in image processing and it is the first step in many document analysis and OCR processes. Most of the binarization techniques associate a certain intensity value called threshold which separate the pixel values of the concerned...

متن کامل

Local Contrast and Mean based Thresholding Technique in Image Binarization

Binarization is a process of separation of pixel values of an input image into two pixel values like white as background and black as foreground. It is an important part in image processing and it is the first step in many document analysis and OCR processes. Most of the binarization techniques associate a certain intensity value called threshold which separate the pixel values of the concerned...

متن کامل

Degraded Document Image Binarization Techniques

Document Image Binarization is performed in the preprocessing stage for document analysis and it aims to segment the foreground text from the document background. A fast and accurate document image binarization technique is important for the ensuing document image processing tasks such as optical character recognition (OCR) and Document Image Retrieval (DIR). This research area has been studied...

متن کامل

Binarization of Document Images

A binary image is a digital image that has just two possible values destined for each pixel. In general two colors are used for a binary image i. e. black and white. Binarization is one of the most important pre-processing step which consists to divide foreground and background of document images. Image binarization is the method of division of pixel values into double collections, black as for...

متن کامل

OCR Based Thresholding

In large-scale digitization processes, several common tasks are performed to provide an electronic version of a paper document. One of the first steps is the thresholding of the image, which is necessary for the following procedures to work properly. Many binarization methods have been proposed to solve this problem, but they need to be tuned on the target document corpus to obtain best results...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013